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ABSTRACT 


Although vector processors have been available for over a decade, the software neces- 
sary to make effective use of these facilities has generally not been available. In this paper 
we describe the design of a compiler for the programming language APL that produces 
code which allows most operations to be performed in parallel. This design is based on a 
similar sequential compiler currently in use. Both compilers use a technique called 
dragthrough to reduce the size of intermediate values generated during the course of execu- 
tion. 
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1. Introduction 


Machines that can execute several arithmetic operations in parallel have existed for some time 
[8, 12, 19, 20]. Nevertheless, the software needed to make use of this ability has been slow in developing. 
Most current high level languages, of the Algol-FORTRAN-Pascal variety, are not designed for the expres- 
sion of parallelism. Thus the attempt to use parallelism has taken two general approaches. The first is to 
analyze source code statically in an attempt to recognize operations (such as loops) that could potentially be 
executed in parallel [3]. As might be expected, this process is very difficult and has met with only limited suc- 
cess. A second approach is to develop new programming languages with explicit parallel instructions [16]. 
The acceptance of new languages is slow, however, and this has the drawback of producing yet another pro- 
gramming language that is available on only a limited number of machines in a limited number of locations. 


Yet there is one programming language that has been in relatively widespread use for a number of years, 
has existing implementations on a number of machines, and for which the recognition of potential parallelism 
is relatively easy. This language is APL [17]. 


APL is usually thought of as an interpretive language. However, in [5, 6] we have described the implemen- 
tation of a working APL compiler, running on a VAX-780 under UNIX . In the course of developing that 
compiler, we recognized the possibility of exploiting the parallelism implicit in the language, and the current 
work is the result of that effort. 


2. Machine Model 


The parallel APL compiler produces code for a hypothetical vector machine. This abstract machine is 
similar to existing vector machines, such as the CRAY-I [8], or the CDC STAR-100 [12]. In addition to the 
usual repertoire of scalar instructions, we assume the existence of vector-vector instructions (such as adding 
two vectors together to produce a third vector) and vector-scalar instructions (such as adding a scalar to each 
element of a vector). In order to be realistic, we assume a fixed upper limit (say 64) on the size of vector opera- 
tions. Each vector operation takes an argument that indicates the size of the vector being acted upon. 


In addition to arithmetic vector commands, we will also make use of the following commands. The first 
command takes a vector of addresses (which need not be contiguous) and produces a vector of values from the 
corresponding positions in memory. The CRAY-I has a simplified form of this command, where the 
addresses must be in an arithmetic progression. As we shall see, such progressions are oftentimes (although 
not always), generated by the APL compiler. In any case, we shall assume the more general capability. The 
second set of instructions are similar to those found on the CDC-STAR. One takes a vector “mask” and a 
vector argument and returns the vector which corresponds to the one bit positions in the mask (in APL termi- 
nology, the compression). Another instruction counts the number of one-bits in a boolean vector mask. A 
final instruction merges two vectors together under the control of a vector mask. For example given the mask 
010011 and arguments {1,2,3} {5,6,7} the command would produce the vector {1,5,2,3,6,7}. 


3. Compiling APL 


The structure of the parallel APL compiler is based on the sequential APL compiler described in [5, 6]. A 
few restrictions have been placed on the source language to insure that the meaning of every token can be 
determined at compile time. In addition, the user can optionally give declarations of variable type and rank, 
in which case better code can be generated. There are no restrictions placed on variable rank or shape. A 
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complete description of the source language is contained in [6]. 


As first noted by Abrams [1] the execution of many APL operations” can be deferred, and their loops 
combined, to produce very space efficient code. For example to compute the result A — B + C * Dasingle 
loop can be constructed that requests elements from B, C and D one by one, operates on them, and stores the 
result. This idea (called dragthrough) is used extensively throughout the sequential APL compiler. The paral- 
lel APL compiler uses a similar technique, requesting a vector of values that are acted upon in parallel to pro- 
duce a vector of results. 


In both the sequential and parallel APL compilers each operation in an expression typically generates 
three different code fragments. The first fragment consists of code to check argument types, ranks and shapes 
for conformability, and to reserve any temporary storage necessary. The second code fragment generates 
some subportion of the required values. The final code fragment releases any temporary storage allocated in 
the first part. Parallelism has very little impact on the first or third of these code fragments, and thus the code 
generated for these tasks is identical to the code generated by the sequential APL compiler. An explanation 
and example of the code generated to check conformability is given in [6] and will not be discussed further 
here. 


The interaction of the various phases of code generation, while simple for any particular APL operation, 
are more complex when the code for an entire expression is viewed as a whole. Many operations, such as 
reshape, take, iota or function calls, require the evaluation of one or both their operand values before their 
own shape can be determined. Thus the code generator cannot simply make three passes over the parse tree. 
The interaction of the various phases is explained fully in [6]. For the remainder of this paper we will concen- 
trate only on value producing code. 


The value producing code for any APL operation can be thought of as responding to a request for some 
subportion of the results to be computed by the subexpression rooted at that node. The operation may in turn 
transform this into another request which is passed on to one or both of its arguments (Figure 1). Eventually 
results will be produced by the arguments, and will be transformed by the operation before being passed back 
to the function which generated the original request. 


INN | requests 
Uf \, t values 


Figure 1: Requests propagate down, values propagate up 


For the most part, results will be requested from a subexpression in a very structured fashion. Assume an 
APL. expression produces a result A of rank n. We define a column to bea vector formed by fixing all but one 
dimension; in APL notation a column is described as A[r 23: 3+, ] for some fixed constants r, 
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duct or reduction, We will use the terms function or operation as 
primitive or composed of two or more functions. 
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will denote a) is the position of the free axis. The column start (denoted ) is the offset position of the element 
AL[r 3ras--ifg 13057 413---fn ] in the ravel sequence of A. Finally the column step (denoted 6) is the distance (in 
ravel sequence) from one element in a column to the next. Thus the ravel position of the kth element in a 
column is B+K* 6. (We will use zero based indexing throughout). 


A subexpression can receive requests for values in one of three forms. An Arithmetic Progression Vector 
(APY) is characterized by five quantities. Besides the column axis, start and step (a, Band 6) there is a vector 
offset, denoted by a value o between 0 and (pA)[a] — I, and a vector length (w). The values being requested 
are column positions o through o+ w— 1. 


A column vector (CV) consists of a column description (a, B and 6), a vector (v) of column offsets, and a 
vector length (w). Each offset is a value between 0 and (pA)[a}-1. The statement vy — 0 + (cw) — 1 can be used 
to convert an arithmetic progression vector to a column vector. Notice that since w must be less than 64, we 
can assume the quantity (cw) — | is a compile time constant, and thus this conversion can be accomplished in 
one or two vector operations. 


An offset vector (OV) is a vector (6) of offset positions in ravel order. The statement 6—- B+6* vcon- 
verts a column vector into an offset vector. Notice again this corresponds to two or three vector operations. 


The particular form used to request values from a subexpression is determined at compile time. In the 
sequel we shall see how this request is altered by various APL operations. A few operations (assignment in 
particular) generate loops to gather values. These loops always generate an APV, it being the most efficient 
request for values. For example, assignment produces the following code. (We will continue to use an abstract 
description language vaguely similar to both APL and iC): 


for (B = 0; B < limit; B = B + step) 
for (o = 0; o < step; o = 0 + 64) { 
w = min(step - 0, 64) 
code to compute results of APV 
store results 


4. Code Generation 


We will divide APL functions into three groups. The first group are those functions that have a space effi- 
cient vector implementation. A special subgroup of these are the so-called grid selectors. Guibas and Wyatt 
[11] have presented a clever space efficient implementation of these functions in the sequential case. We have 
elsewhere [7] presented extensions to this method that remove some of the restrictions imposed in the original 
paper. We will here show how the Guibas method can be implemented in the context of parallel requests. 


The second group of functions will be those that cannot be executed in parallel, but which still have space 
efficient sequential implementations. Given a request for a vector of values, these will convert the request into 
a sequence of requests for individual elements, combine the results, and pass the vector of results back to the 
originating function. 


The third group of functions will be /eaves and pseudo leaves. These are either terminal objects in the 
parse three (such as variables or constants) or operations (such as sort) that cannot be executed in a space effi- 
cient element by element manner. Table | lists the category for each of the APL functions we will discuss. 


Space Efficient Parallel Operations 


Catenate 

Compression 

Expansion 

lota 

Outer Product 

Reduction 

Reshape and Ravel 

Reversal 

Rotation 

Monadic and Dyadic Scalar Operators 
Subscripts 

Monadic and Dyadic Transpose 
Take/Drop 


Sequential Operations 


Dyadic lota 

Go arrow 

Grade up and Grade down 
Inner Product 
Membership 

Scan 


Space Inefficient Operations — 


Assignment 
Constants 

Deal and Roll 
Decode and Encode 
Function Calls 
Quad Input/Output 
Shape and Rank 
Variables 


Table 1: APL Operations by categories 


4.1 Space Efficient Operations 


A large number of APL operations do not alter data values themselves, but instead alter the way data is 
accessed. Examples of such functions include transpose, compression, reshape or catenation. These opera- 
tions can be thought of as acting on the request data, and not on the result data. Other operations, such as 
inner and outer product or reduction, operate on both request and result values. Finally scalar operations, 
such as plus, times or absolute value, act only on result data. 


Assume we are generating code for a subexpression producing a result A of rank n. Let the'vector s; 
represent the shape of A. The vector e},...,¢, defined by the recursion 


computed by offset =a,e, tae, +--+ +4, e 
Using the expansion vector definition, we note the following identities: 
Oy ey + 7+ FG, ey = offset mod e-, k >I Il 


aye; +++ +a, & = ex * (offset div e&) M2 


From I.1 we get that fork >1 
a, =((offset mod e,-,) — (offset mod e )) / & 
Since the result is integral and offset mod e, is strictly less than e, , this reduces to 
a, = (offset mod e,_\) dive k >I 13 
a, = offset div e, 
To move m steps along dimension k corresponds to adding m * e, to an offset. Suppose, however, that in 
addition to moving from position a to a’, along dimension k, that dimension is also being expanded or con- 


tracted (as in the APL operations expansion or compression). Let the expansion vector for the second object 
be represented by e ‘;. (Note that e ; =e, fori >k). Thus we need to discover the difference between: 


offset |=ayeyt-+: tae ++. ta, e, 14 
Offset2=aye';t++°> tae, to +a,e', 1.5 
Consider the subexpression 
ayer tor Faye 1.6 


& -) is a factor in each of the e terms, hence can be distributed out, yielding a value e _;X for some quantity 
X. Bute’; =(e; div e&)* e,. Thus 


aye tes tay ye) =e, 1X 1.7 
We can therefore rewrite 1.4 and 1.5 as 
Offset 1 =e,.X +a ea +Y 
Offset 2=e',,X tae +Y 
The difference between offset | and offset 2 is then 
(0-1 —% DX +a, — a deg 
Since e 4, -, =s , e, , we can factor an e, out of both the leftmost terms, producing 
(8 — 5X +(a — ay Ne 
Comparing I.7 to 1.2 we have that X = offset div e, _, giving us our final form for the difference in offsets 
(8, = 5 Moffset div e 4) +(a’, — a )) & 1.8 
We use these results to construct a number of primitive operations. Although we will write these as if they 
were procedure calls, in the compiler they are implemented as code generation templates producing in line 


code. Frequently, the values produced by these primitives can be computed at compile time, or at worst once 
during the shape phase. 


The first primitive, expvec(s, k), takes a shape vector and a position and computes the single element c 
from the expansion vector. 

The next primitive, elepos(offset, shape, k), takes an offset position, a shape vector, and an axis position. 
and returns the axis element in the subscript which corresponds to the offset position, using equations 1.3. 
The parameter offset can be a vector of values, in which case a vector result is produced. 

The routine chngpos takes an offset, a shape vector, a dimension k, and two scalars. The first scalar is the 
amount dimension k has been altered, either positively or negatively. The second is the amount of the desired 
movement along axis k. This is simply an encoding of equation 1.8. 


chngpos(s, 0, k, c, d) = (c * (offset / expvec(s, k-1)) + d) * expvec(s, k) 


Note that the offsets to elepos and chngpos can be vectors, in which case vector results are computed. 


4.1.1 Compress 


Suppose we are generating code for the compression A — V/[k] B. During the shape phase the left argu- 
ment V is converted into a map: 


map[i] = position of the i” one bit in argument. 
There are then four cases for compress. 
Case I: It can be determined at compile time that a # k, that is the request axis is not equal to the axis of 


operation, and request is by APV or CV. In this case only the starting position B for the column needs to be 
altered. 


i = elepos(B, pA, k); 
=pA 

p A)Lk] - (p Bk] 

d =i - map[i) 

B’ = chngpos(s, B, k, c, d); 


Case 2: It can be determined at compile time that a == k and request is by APV or CV. If an arithmetic pro- 


gression vector is used it is first converted into a column vector. The offsets v are then altered by indexing into 
the map. 

v= map[v] 
Case 3: Request is by APV or CV but it cannot be determined at compile time whether a ==k. In this case the 


APYV is converted into a column vector, and an if statement is generated to surround the code generated 
for cases I and 2. 


Case 4: Request is by offset vector. In this case a vector is produced corresponding to the subscript values of 
the offsets along the k axis, and this is used to transform the offsets. 


iv = elepos(6, pA, k); 


s=pA 
¢ = (p A)[k] - (p B)[k] 
dv = iv - map[iv] 


6 = chngpos(s, 6, k, c, dv); 


4.1.2 Expansion 


During the shape phase the left argument is converted into two maps. Mapais simply the bit vector value of 
the argument. Mapb contains a zero for each zero in the argument, and each one bit in the argument is 
replaced by a number representing the number of one bits to the left of the position. 


Case I: If request is by APV or CV and it can be determined at compile time that a k then only the valuc B 
needs to be tested. If it corresponds to a zero in the left argument then a vector of identity elements is 
returned, otherwise 8 is modified using mapb as in compression. 

Case 2: As noted in section 2, we will assume vector commands that correspond toa simplified form of the APL. 
function compress, and to merge two vectors together. The request is first converted into an offset vec-tor. The 
compression operator is then used to produce only the necessary elements (those that correspond to nonzero 
clements in the map). This offset vector is modified as in the compression code, using mapb for the map. 

Following the return of the results, mapa is used to merge elements together with a vector of identity cle-ments, 


iv = elepos(6, p A, k) 

6 =compress(mapa[iv], 8) 

computation of 8 as in compression 

compute result 

res = merge(mapD[iv], identity(assign, w), res) 


4.1.3 Dyadic and Monadic Scalars 


A request of a scalar function is simply passed unchanged to the argument subexpressions. When values 
are returned, they are operated on and passed back to the node generating the original request. The possibility 
of scalar extension is determined during the shape phase, and appropriate scalar/ vector commands issued. 


Scalar functions that are surrounded on both top and bottom by grid selectors can frequently be merged 
into the grid selector code. The algorithm for doing this is discussed in [7], and will not be presented here. 


4.1.4 Reduction 


Let B represent the subexpression being reduced, and let k represent the axis along which the reduction 
will take place. Reduction generates it own loop to scan and operate on a column of its subexpression. As we 
shall see, the loop only modifies the request values, and all results are generated in parallel. 


The following function (again, the compiler would actually generate in line code), takes an offset (either 
scalar or vector) and returns the corresponding offset in B positioned at the start of the column being reduced 
along. 


fl(off, shape, k) 
i= (off / expvec(shape, k)) * shape[k] + off mod expvec(shape, k) 
return(i) 


The second function takes a delta value (the amount needed to move along a specific axis) and returns the 
corresponding delta value in B. 


f2(delta, alpha, shape, k) 
if (alpha < k) 
delta = delta * shape[k] 
return(delta) 


Case I: Request is by APV or CV. The reductions can all be done in parallel simply by changing the values of 
the column start. 


& = £2(6, a, p B, k); 
res = identity(op, w); 
step = expvec((p B), k); 
i =( B) [k] 
B =f1(B) + i * step; 
for (; i > 0; i-) { 
compute subresult res’ from B 
res = res’ op res 
B’ = B’ - step; 
} 


Case 2: Request is by offset vector. The reductions can all be done in parallel by changing the offset vector. 


res = identity(op, w); 
step = expvec((p B), k); 
i = (o B) [k] 
6 =f1(6, p B, k) +i * step; 
for (; i > 0; i-) { 
compute subresult res’ from B 
res = res’ op res 
0 = @ - step; 
} 


4.1.5 Rotation 
Let the rotation be represented by BO[k] C. 
Case I. Request is by APV or CV and it can be determined at compile time that a ==k. In this case the entire 


column is rotated by a fixed amount. That amount is represented by i (the computation to determine i is 
straightforward and need not be given). The offsets of the column vector are then altered as follows: 


n =(v + i) mod (p C)[k}; 
v' = vy +(n- i) * expvec(p C, k); 


Case 2: The request is first changed into an offset vector. These offsets are converted into corresponding 
offsets for B, producing a vector iv of amounts to rotate each element by. 


jv = elepos(6, p B, k); 
nv = (jv + iv) mod (p C)[k}; 
& = @ + (nv - iv) * expvec(p C, k); 


4.1.6 Reshape and Ravel 

There is no guarantee that a column in the right hand argument to reshape will correspond to a column in 
the final result. Nevertheless, offset positions do correspond. Thus reshape merely converts its request into an 
offset vector which it pa: to its child. It returns the value returned by its child unchanged. 


Ravel is a simple subcase of reshape. 


4.1.7 Grid Selectors 


In [11] Guibas and Wyatt show how to generate very efficient code for the functions take, drop, monadic 
and dyadic transpose, and reversal. In [7] we show how to handle some special cases not treated in the origi- 
nal, and how to extend the method to increase its applicability. 


With one exception the grid selectors have the property that a column in the argument expression is 
mapped into a column in the result expression. The one exception is the special reducing form of dyadic tran- 
spose, by which diagonals in the subexpression get mapped into columns of the result. In this case a request 
by APV or CV is first converted into an offset vector, which is then modified as below. Note that rotation is 
not considered a grid selector largely because it does not have this column preserving property. 


‘The Guibas method is too complex for a detailed explanation to be given here. It suffices to say that dur- 
ing the shape phase any sequence of grid selectors is transformed into an object called a stepper. A stepper 
's of four quantities, only one of which is important to the current discussion: an array q[I .. pp A] that 
ibes which dimension in the subtree expression the i” dimension in the grid expression represents. We 
actually use the inverse function, which we will denote q”'. The stepper is used to construct two vectors, 

delta‘and gamma, that are related to, but not identical with, our step value 6. The stepper also generated a 
_Scalar 7, which is the ravel offset position of the element which corresponds to the first element of the result 
value. 


Uhe following algorithm takes an offset to the topmost grid sclector and returns the offset by which the 


corresponding element is accessed in the subexpression structure. 


int guibastrans(off, s) 
newoff = 7 
for (i =ps- 1;i>0; i-) { 
newoff = newoff + off * delta[i] 
off = off / s[i] 
} 


Using these quantities, there are then two cases. 
Case I: Request is by APV or CV. 


Lal; 
B’ = guibastrans(B, p A); 
& = gamma{a’}; 


Case 2: Request is by OV. 
6’ = guibastrans(6, p A); 


4.1.8 Catenate 

Let k represent the axis along which catenation is to take place. If it can be determined at compile time 
that a *k then it suffices to merely decide which argument contains the results, since they must all be in one 
argument or the other. If such a determination cannot be made, then the request is first changed into an offset 
vector. A bit vector mask is then constructed showing which results come from which argument. New request 
vectors are then constructed for each argument. When the results are returned they are merged together to 
form the final result. Let const represent the shape of the left argument along the kth dimension. 


iv = elepos(@, p A, k) > const 

6 = compression(iv, 6) 

6” = (compression( ~iv, 6) - const) * expvec(p A, k) 
compute results 

res = merge(iv, resl, res2) 


4.1.9 Outer Product and Subscripts 
Let the outer product A be computed by B o.op C. 


Case I: If the request is by APV or CV then the result will always be computed by a vector from one argument 
and a scalar from the other, depending upon whether a < pp B or not. 


B’= B/ delta (p A, pp B) 
get result from B 

B” = B mod delta (p A, pp B) 
get result from C 

res = resl op res2 


Case 2: If the request is by offset vector two new offset vectors are generated. 


6’ =0/ delta (p A, pp B) 
get result from B 

6” = 6 mod delta (p A, pp B) 
get result from C 

res = resI op res2 


Subscripting can be considered to be a special case of outer product. A request is first passed to the sub- 
script expression, where the outer product operator multiplies the left result by the rank of the left expression 
and adds the right result. The final vector of results is then used as an offset vector into the subscripted 
expression. 


4.2 Sequential Functions 


A few APL functions, such as scan, membership or dyadic iota, cannot easily be executed in parallel. This 
is because the amount of work necessary to compute an element of the result is not constant from clement to 
clement. Consider the function membership, for example. There are various methods one could compute 
membership [2, 6], different methods providing advantages in time or space over other methods. In the 
sequential APL compiler the right argument is collected and sorted during the shape phase. During the value 
phase a request is passed unchanged to the left argument, and when the result is returned a binary search is 
used to determine whether the value is contained in the right argument. Since the number of steps in the 
binary search depends on the distribution of values, it would be difficult to execute in parallel with several 
values. 

Nevertheless, the algorithms used in the sequential APL compiler are important in reducing the amount of 
space used by intermediate results. Thus when a sequential function receives a request for a vector of values, it 
produces its own internal loop to compute the results one by one, using the same algorithms used in the 
sequential compiler. 


4.3 Leaves 


Leaves are either terminal objects in the parse tree, such as constants or variables, or they are functions 
that cannot be executed either in parallel or in a space efficient manner. Examples of the latter are user func- 
tion calls, or the sorting functions gradeup and gradedown. 


In the case of leaf functions, during the shape phase loops are generated to collect arguments. These loops, 
like the loops generated by assignment, can collect a number of values for each iteration of the loop. The 
results of the function are then computed in the conventional fashion, and are left in a temporary variable. (In 
some cases, such as scan or inner product, the computation can be optimized slightly by using parallel requests 
along a particular dimension. These optimizations are minor and will not be considered in detail here.) The 
temporary variable constructed by the operation then acts exactly like a program variable. 


During the value phase, a leaf is characterized by a starting location for the values stored in memory. 
(Other attributes, such as size and shape, are important during the shape phase, but are unimportant to the 
value producing code). As we noted in section 2, some machine instruction sets, such as that on the CRAY-1, 
can generated very efficient code for requests given in the APV form. In gencral, however, we will simply con- 
vert the request into an offset vector and generate the following: 


result = start[@] 


5. Usage Patterns 


Table 2 summarizes how request modes are altered as they get passed down a parse tree. Where two 
entries are given it implies different modes are used depending upon the relationship between the operation 
axis and request axis. Note that given an arithmetic progression vector, only a few operators are forced to 
generate the less efficient column or offset vectors. Thus, for example, in the classic idiom for generating 
primes less than n: 

(2=+7 0=(cn)o.Jen)/in 


-10- 


all requests in the computation are carried out by arithmetic progression vectors. 


Node Access APV OV ov 
Catenate APV/OV CV/OV OV 
Compression APV/CV cv OV 
Expansion APV/OV CV/OV OV 
Outer Product APV cv OV 
Reduction APV cv OV 
Reshape and Ravel OV Ov Ov 
Reversal APV cv ov 
Rotation CVv/OV CV/OV OV 
Monadic and Dyadic Scalar Operators APV CV OV 
Subscripts APV/OV CV/OV OV 
Monadic and Dyadic Transpose APV cv OV 
Take/Drop APV CV OV 


Table 2: Access Request Modification 


The outlook is even more promising when we consider usage patterns for APL operations. Saal and Weiss 
[18] examined 32 workspaces from various IBM installations and observed the usage patterns shown in Table 
3. Note that over 90% of APL operations consist of either primitive scalar operations or subscripting, both of 
which are very efficient in the current scheme. 


Operation Percentage of Use 
Scalar Primitives 73 
Subscripting 18 
User Function Calls 3.6 
Reduction 2.6 
Mixed Output 1.0 
Explicit Axis Specification 0.6 
Inner Product 0.4 
Outer Product 0.4 
Scan 0.1 


Table 3: Ap! Usage Patterns (from Saal [18]) 


6. Conclusions 


The use of a language on a parallel machine that also has implementations on several scalar computers has 
obvious advantages. Programs can be developed in one environment and then moved to the parallel machine 
after they are completed. This reduces the amount of computation time spent on the (typically costly) parallel 
processor. It is likely that such a process is to preferred in any case, since APL interpretive systems provide 
advantages in program design and debugging not possible with a compiler. 


Note that the more operations in an expression, the greater is the likelihood that dragthrough techniques 
can be successfully applied to reduce the amount of intermediate storage necessary to produce a result. Thus 
the infamous “one-liner”, considered almost an art form among supporters of APL [15] and strongly 
denounced by detractors of the language [9], is here shown to have a practical benefit. Of course one could 
distinguish between “physical” one-liners (programs written on one line) and “conceptual” one-liners (pro- 
grams written without looping structure). In the latter case simple data-flow techniques [14] can be used to 
determine statements that can be combined to facilitate dragthrough and to eliminate temporary variables 
that are not used elsewhere. 


We can even consider taking this analysis one step further. There is currently in the APL community a 


oy 


movement to structure programs in terms of a large number of single line functions (so called “direct defini- 
tion” form [13]). Stripped of the complexities of local name collision, one can imagine a preprocessor that 
could examine a collection of APL functions and expand function calls in line. this would in effect be trading 
greater object size for efficiencies in the compiled code. 

The basic ideas in the design of the parallel APL compiler have been used in the construction of a working 
sequential compiler for the VAX-780. Timings of this compiler indicate that the code it generates is competi- 
tive with or better than many APL interpreters. In addition, the extensive use of dragthrough techniques sig- 
nificantly reduces the amount of temporary storage needed in evaluating expressions. In practice temporary 
storage is frequently as great or greater a limitation than program size or execution time. 

On the negative side, as noted previously, there is no way a compiler can provide the same type of interac- 
tive design and debugging tools as an interpreter. Ideally, one would like to integrate the compiler and an 
interpreter, thus providing the advantages of both in a complete development environment along the LISP 
model [4, 10]. 
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